Boiling down information retrieval test collections
نویسندگان
چکیده
Constructing large-scale test collections is costly and timeconsuming, and a few relevance assessment methods have been proposed for constructing “minimal” information retrieval test collections that may still provide reliable experimental results. In contrast to building up such test collections, we take existing test collections constructed through the traditional pooling approach and empirically investigate whether they can be “boiled down.” More specifically, we report on experiments with test collections from both NTCIR and TREC to investigate the effect of reducing both the topic set size and the pool depth on the outcome of a statistical significance test between two systems, starting with (approximately) 100 topics and depth-100 pools. We define cost (of manual relevance assessment) as the pool depth multiplied by the topic set size, and error as a system pair whose outcome of statistical significance testing differs from the original result based on the full test collection. Our main findings are: (a) Cost and the number of errors are negatively correlated, and any attempt at substantially reducing cost introduces some errors; (b) The NTCIR-7 IR4QA and the TREC 2004 robust track test collections all yield a comparable and considerable number of errors in response to cost reduction, and this is true despite the fact that the TREC relevance assessments relied on more than twice as many runs as the NTCIR ones; (c) Using 100 topics with depth-30 pools generally yields fewer errors than using 30 topics with depth-100 pools; and (d) Even with depth-100 pools, using fewer than 100 topics results in false alarms, i.e. two systems are declared significantly different even though the full topic set would declare otherwise.
منابع مشابه
Toward meaningful test collections for information integration benchmarking
Meaningful comparison of algorithms for search, extraction and aggregation across heterogeneous sources will require welldesigned benchmarks, preferably based on freely available test collections. In this paper, we discuss issues which will inevitably arise during the construction of such benchmarks. We argue that the creation of benchmarks requires careful consideration of the evaluation metho...
متن کاملTracks and Topics: Ideas for Structuring Music Retrieval Test Collections and Avoiding Balkanization
This paper examines a number of ideas related to the construction of test collections for evaluation of music information retrieval algorithms. The ideas contained herein are not so much new as they are a synthesis of existing proposals. The goal is to create retrieval techniques which are as broadly applicable as possible, and the proposed manner for creating test collections supports this goal.
متن کاملBuilding Reliable Test and Training Collections in Information Retrieval
Research in Information Retrieval has significantly benefited from the availability of standard test collections and the use of these collections for comparative evaluation of the effectiveness of different retrieval system configurations in controlled laboratory experiments. In an attempt to design large and reliable test collections decisions regarding the assembly of the document corpus, the...
متن کاملEvaluation Issues in Information Retrieval
Evaluation techniques are critical to research in all areas of science and engineering. The major evaluation techniques used in information retrieval today were put in place over 20 years ago when Cyril Cleverdon’s Cranfield projects started an evaluation pattern that still largely dominates the field, that is, using recall/precision measures and using one of the standard test collections. Wher...
متن کاملThree Criteria for the Evaluation of Music Information Retrieval Techniques Against Collections of Musical Material
Evaluation of MIR systems requires honesty and skepticism with respect to the selection of test collections and the interpretation of results. We describe three minimal criteria which help to ensure this honesty and skepticism through appropriate selection of test collections, elimination of bias, and objective analysis of results.
متن کامل